Predicting Price with Size, Location, and Neighborhood
In this project the goal is to predict apartment prices in Buenos Aires, Argentina making use of predictive data science through a linear regression model. A few rules must followed:
import warnings
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px
from glob import glob
from category_encoders import OneHotEncoder
from ipywidgets import Dropdown, FloatSlider, IntSlider, interact
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error
from sklearn.pipeline import make_pipeline
warnings.simplefilter(action="ignore", category=FutureWarning)
Checking the shape, info and head to get a sense of what kind of data I will be working with.
df1 = pd.read_csv("data-1.csv", encoding="ISO-8859-1")
df1.shape
(8606, 16)
df1.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 8606 entries, 0 to 8605 Data columns (total 16 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 operation 8606 non-null object 1 property_type 8606 non-null object 2 place_with_parent_names 8606 non-null object 3 lat-lon 6936 non-null object 4 price 7590 non-null float64 5 currency 7590 non-null object 6 price_aprox_local_currency 7590 non-null float64 7 price_aprox_usd 7590 non-null float64 8 surface_total_in_m2 5946 non-null float64 9 surface_covered_in_m2 7268 non-null float64 10 price_usd_per_m2 4895 non-null float64 11 price_per_m2 6520 non-null float64 12 floor 1259 non-null float64 13 rooms 4752 non-null float64 14 expenses 875 non-null object 15 properati_url 8606 non-null object dtypes: float64(9), object(7) memory usage: 1.1+ MB
df1.head()
| operation | property_type | place_with_parent_names | lat-lon | price | currency | price_aprox_local_currency | price_aprox_usd | surface_total_in_m2 | surface_covered_in_m2 | price_usd_per_m2 | price_per_m2 | floor | rooms | expenses | properati_url | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | sell | apartment | |Argentina|Capital Federal|Villa Crespo| | -34.6047834183,-58.4586812499 | 180000.0 | USD | 2729232.0 | 180000.0 | 120.0 | 110.0 | 1500.000000 | 1636.363636 | NaN | 4.0 | NaN | http://villa-crespo.properati.com.ar/12egq_ven... |
| 1 | sell | house | |Argentina|Bs.As. G.B.A. Zona Oeste|La Matanza... | NaN | 250000.0 | USD | 3790600.0 | 250000.0 | 117.0 | 120.0 | 2136.752137 | 2083.333333 | NaN | 4.0 | NaN | http://ramos-mejia.properati.com.ar/s7pd_venta... |
| 2 | sell | house | |Argentina|Bs.As. G.B.A. Zona Oeste|Morón|Cast... | -34.6497002,-58.658073 | 410000.0 | USD | 6216584.0 | 410000.0 | 410.0 | 220.0 | 1000.000000 | 1863.636364 | NaN | NaN | NaN | http://castelar-moron.properati.com.ar/11vgn_v... |
| 3 | sell | house | |Argentina|Bs.As. G.B.A. Zona Oeste|Tres de Fe... | -34.5957086,-58.5669503 | 180000.0 | USD | 2729232.0 | 180000.0 | 200.0 | 135.0 | 900.000000 | 1333.333333 | NaN | 5.0 | NaN | http://tres-de-febrero.properati.com.ar/7f7u_v... |
| 4 | sell | apartment | |Argentina|Capital Federal|Chacarita| | -34.5846508988,-58.4546932614 | 129000.0 | USD | 1955949.6 | 129000.0 | 76.0 | 70.0 | 1697.368421 | 1842.857143 | NaN | NaN | NaN | http://chacarita.properati.com.ar/10qlv_venta_... |
At first glance there are a few problems that need to be taken care of. Since we want to predict the value with the location the NaN values in the "lat-lon" column must be treated. For now I will explore the data further.
Seems like there are several files with similar information so instead of processing them one by one it would be ideal to build a function that can do everything at once.
def wrangle(filepath):
"""
Read CSV files into a 'DataFrame'.
Returns only property apartments in Capital Federal which the price is less than 400 000 USD.
Parameters
----------
filepath : str
Location of CSV file.
"""
# Import CSV
df = pd.read_csv(filepath, encoding="ISO-8859-1")
# Subset to properties in 'Capital Federal'
mask_ba = df["place_with_parent_names"].str.contains("Capital Federal")
# Subset to 'apartments'
mask_apt = df["property_type"] == "apartment"
# Subset to properties where price is less than 400 000
mask_price = df["price_aprox_usd"] < 400_000
# Applying all masks
df = df[mask_ba & mask_apt & mask_price]
# Return new Data Frame
return df
df = wrangle("data-1.csv")
df.shape
(1781, 16)
Size is usually an important factor in determining the price of an apartment. With that in mind, it is important to look at the distribution of apartment sizes in the dataset.
# Plotting a histogram related to the price per m2
plt.hist(df["surface_covered_in_m2"])
# Detailing title and axis, ';' to supress any text output
plt.title("Distribution of Apartment Sizes")
plt.xlabel("Area [sq meters]")
plt.ylabel("Frequency");
Seems like the dataset has some extreme outliers. This can affect model performance — especially in the sorts of linear models. To confirm, the describe method will be in handy.
df.describe()
| price | price_aprox_local_currency | price_aprox_usd | surface_total_in_m2 | surface_covered_in_m2 | price_usd_per_m2 | price_per_m2 | floor | rooms | |
|---|---|---|---|---|---|---|---|---|---|
| count | 1.781000e+03 | 1.781000e+03 | 1781.000000 | 1247.000000 | 1635.000000 | 1198.000000 | 1618.000000 | 482.000000 | 1385.000000 |
| mean | 2.285410e+05 | 2.178037e+06 | 143647.215239 | 114.417001 | 97.877064 | 2231.534726 | 4340.345113 | 10.085062 | 2.434657 |
| std | 4.100189e+05 | 1.141710e+06 | 75298.758506 | 1755.263895 | 1533.057610 | 650.139587 | 8126.940810 | 44.783320 | 1.065193 |
| min | 0.000000e+00 | 0.000000e+00 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 2.224587 | 1.000000 | 1.000000 |
| 25% | 9.000000e+04 | 1.349454e+06 | 89000.000000 | 40.000000 | 38.000000 | 1815.485830 | 1973.750000 | 2.000000 | 2.000000 |
| 50% | 1.300000e+05 | 1.849813e+06 | 122000.000000 | 55.000000 | 50.000000 | 2151.642336 | 2411.239496 | 4.000000 | 2.000000 |
| 75% | 1.911340e+05 | 2.653420e+06 | 175000.000000 | 79.000000 | 73.000000 | 2577.003583 | 2958.157462 | 7.000000 | 3.000000 |
| max | 5.487000e+06 | 6.034635e+06 | 398000.000000 | 62034.000000 | 62034.000000 | 5975.609756 | 57156.250000 | 616.000000 | 6.000000 |
It will be easier to read if I take out the scientific notation on big numbers though:
df.describe().apply(lambda s: s.apply('{0:.5f}'.format))
| price | price_aprox_local_currency | price_aprox_usd | surface_total_in_m2 | surface_covered_in_m2 | price_usd_per_m2 | price_per_m2 | floor | rooms | |
|---|---|---|---|---|---|---|---|---|---|
| count | 1781.00000 | 1781.00000 | 1781.00000 | 1247.00000 | 1635.00000 | 1198.00000 | 1618.00000 | 482.00000 | 1385.00000 |
| mean | 228541.04380 | 2178036.53577 | 143647.21524 | 114.41700 | 97.87706 | 2231.53473 | 4340.34511 | 10.08506 | 2.43466 |
| std | 410018.91819 | 1141709.89618 | 75298.75851 | 1755.26390 | 1533.05761 | 650.13959 | 8126.94081 | 44.78332 | 1.06519 |
| min | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 2.22459 | 1.00000 | 1.00000 |
| 25% | 90000.00000 | 1349453.60000 | 89000.00000 | 40.00000 | 38.00000 | 1815.48583 | 1973.75000 | 2.00000 | 2.00000 |
| 50% | 130000.00000 | 1849812.80000 | 122000.00000 | 55.00000 | 50.00000 | 2151.64234 | 2411.23950 | 4.00000 | 2.00000 |
| 75% | 191134.00000 | 2653420.00000 | 175000.00000 | 79.00000 | 73.00000 | 2577.00358 | 2958.15746 | 7.00000 | 3.00000 |
| max | 5487000.00000 | 6034635.20000 | 398000.00000 | 62034.00000 | 62034.00000 | 5975.60976 | 57156.25000 | 616.00000 | 6.00000 |
The statistics above confirms. While most of the apartments in the dataset are smaller that 73 square meters, there are some that are several thousand square meters. The best thing to do is to change our wrangle function and remove them from the dataset. (I will be repeating the wrangle every time I update it to help with visualization and understanding but in real project I would written it only once with all the updates)
def wrangle(filepath):
"""
Read CSV files into a 'DataFrame'.
Returns only property apartments in Capital Federal which the price is less than 400 000 USD.
Parameters
----------
filepath : str
Location of CSV file.
"""
# Import CSV
df = pd.read_csv(filepath, encoding="ISO-8859-1")
# Subset to properties in 'Capital Federal'
mask_ba = df["place_with_parent_names"].str.contains("Capital Federal")
# Subset to 'apartments'
mask_apt = df["property_type"] == "apartment"
# Subset to properties where price is less than 400 000
mask_price = df["price_aprox_usd"] < 400_000
# Applying all masks
df = df[mask_ba & mask_apt & mask_price]
# Remove outliers by "surface_covered_in_m2"
low, high = df["surface_covered_in_m2"].quantile([0.1, 0.9])
mask_area = df["surface_covered_in_m2"].between(low, high)
df = df[mask_area]
# Return new Data Frame
return df
df = wrangle("data-1.csv")
df.shape
(1343, 16)
# Plotting a histogram related to the price per m2
plt.hist(df["surface_covered_in_m2"])
# Detailing title and axis, ';' to supress any text output
plt.title("Distribution of Apartment Sizes")
plt.xlabel("Area [sq meters]")
plt.ylabel("Frequency");
Without massive outliers the histogram looks much better
Now I will check if there is a proper relationship between apartment size and price. A scatter plot will be apropriate
# Plotting a scatter plot
plt.scatter(df["surface_covered_in_m2"], df["price_aprox_usd"])
# Detailing title and axis, ';' to supress any text output
plt.title(" Area X Price")
plt.xlabel("Area [sq meters]")
plt.ylabel("Price [USD]")
Text(0, 0.5, 'Price [USD]')
This plot suggests that there is a moderate positive correlation between apartment price and size. This means that if the price is the goal of the prediction, size will be a good feature to include.
Since apartment location will be considered for the final prediction, specifically, latitude and longitude it will be interesting to separate them and also, looking at the info it is noticeable that the data type is object. In order to build a model, it is necessary that latitude and longitude to each be in their own column where the data type is float.
def wrangle(filepath):
"""
Read CSV files into a 'DataFrame'.
Returns only property apartments in Capital Federal which the price is less than 400 000 USD.
Parameters
----------
filepath : str
Location of CSV file.
"""
# Import CSV
df = pd.read_csv(filepath, encoding="ISO-8859-1")
# Subset to properties in 'Capital Federal'
mask_ba = df["place_with_parent_names"].str.contains("Capital Federal")
# Subset to 'apartments'
mask_apt = df["property_type"] == "apartment"
# Subset to properties where price is less than 400 000
mask_price = df["price_aprox_usd"] < 400_000
# Applying all masks
df = df[mask_ba & mask_apt & mask_price]
# Remove outliers by "surface_covered_in_m2"
low, high = df["surface_covered_in_m2"].quantile([0.1, 0.9])
mask_area = df["surface_covered_in_m2"].between(low, high)
df = df[mask_area]
# Split the "lat-lon" col
df[["lat", "lon"]] = df["lat-lon"].str.split(",", expand=True).astype(float)
df.drop(columns="lat-lon", inplace=True)
# Return new Data Frame
return df
df = wrangle("data-1.csv")
df.head()
| operation | property_type | place_with_parent_names | price | currency | price_aprox_local_currency | price_aprox_usd | surface_total_in_m2 | surface_covered_in_m2 | price_usd_per_m2 | price_per_m2 | floor | rooms | expenses | properati_url | lat | lon | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 4 | sell | apartment | |Argentina|Capital Federal|Chacarita| | 129000.0 | USD | 1955949.6 | 129000.0 | 76.0 | 70.0 | 1697.368421 | 1842.857143 | NaN | NaN | NaN | http://chacarita.properati.com.ar/10qlv_venta_... | -34.584651 | -58.454693 |
| 9 | sell | apartment | |Argentina|Capital Federal|Villa Luro| | 87000.0 | USD | 1319128.8 | 87000.0 | 48.0 | 42.0 | 1812.500000 | 2071.428571 | NaN | NaN | NaN | http://villa-luro.properati.com.ar/12m82_venta... | -34.638979 | -58.500115 |
| 29 | sell | apartment | |Argentina|Capital Federal|Caballito| | 118000.0 | USD | 1789163.2 | 118000.0 | NaN | 54.0 | NaN | 2185.185185 | NaN | 2.0 | NaN | http://caballito.properati.com.ar/11wqh_venta_... | -34.615847 | -58.459957 |
| 40 | sell | apartment | |Argentina|Capital Federal|Constitución| | 57000.0 | USD | 864256.8 | 57000.0 | 42.0 | 42.0 | 1357.142857 | 1357.142857 | 5.0 | 2.0 | 364 | http://constitucion.properati.com.ar/k2f0_vent... | -34.625222 | -58.382382 |
| 41 | sell | apartment | |Argentina|Capital Federal|Once| | 90000.0 | USD | 1364616.0 | 90000.0 | 57.0 | 50.0 | 1578.947368 | 1800.000000 | NaN | 3.0 | 450 | http://once.properati.com.ar/suwa_venta_depart... | -34.610610 | -58.412511 |
I will plot a map using plotly to visualize how well the DataFrame depicts the apartments and its prices
fig = px.scatter_mapbox(
df, # DataFrame
lat='lat',
lon='lon',
width=600, # Width of map
height=600, # Height of map
color='price_aprox_usd',
hover_data=["price_aprox_usd"], # Display price when hovering mouse over house
)
fig.update_layout(mapbox_style="open-street-map")
fig.show()
Seems ok but there are probably more properties. Using glob I will be concatenating all files regarding the folder:
files = glob("data-*.csv")
files
['data\\data-1.csv', 'data\\data-2.csv', 'data\\data-3.csv', 'data\\data-4.csv', 'data\\data-5.csv']
frames = []
for file in files:
frames.append(wrangle(file))
len(frames)
5
df = pd.concat(frames)
df.head()
| operation | property_type | place_with_parent_names | price | currency | price_aprox_local_currency | price_aprox_usd | surface_total_in_m2 | surface_covered_in_m2 | price_usd_per_m2 | price_per_m2 | floor | rooms | expenses | properati_url | lat | lon | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 4 | sell | apartment | |Argentina|Capital Federal|Chacarita| | 129000.0 | USD | 1955949.6 | 129000.0 | 76.0 | 70.0 | 1697.368421 | 1842.857143 | NaN | NaN | NaN | http://chacarita.properati.com.ar/10qlv_venta_... | -34.584651 | -58.454693 |
| 9 | sell | apartment | |Argentina|Capital Federal|Villa Luro| | 87000.0 | USD | 1319128.8 | 87000.0 | 48.0 | 42.0 | 1812.500000 | 2071.428571 | NaN | NaN | NaN | http://villa-luro.properati.com.ar/12m82_venta... | -34.638979 | -58.500115 |
| 29 | sell | apartment | |Argentina|Capital Federal|Caballito| | 118000.0 | USD | 1789163.2 | 118000.0 | NaN | 54.0 | NaN | 2185.185185 | NaN | 2.0 | NaN | http://caballito.properati.com.ar/11wqh_venta_... | -34.615847 | -58.459957 |
| 40 | sell | apartment | |Argentina|Capital Federal|Constitución| | 57000.0 | USD | 864256.8 | 57000.0 | 42.0 | 42.0 | 1357.142857 | 1357.142857 | 5.0 | 2.0 | 364 | http://constitucion.properati.com.ar/k2f0_vent... | -34.625222 | -58.382382 |
| 41 | sell | apartment | |Argentina|Capital Federal|Once| | 90000.0 | USD | 1364616.0 | 90000.0 | 57.0 | 50.0 | 1578.947368 | 1800.000000 | NaN | 3.0 | 450 | http://once.properati.com.ar/suwa_venta_depart... | -34.610610 | -58.412511 |
fig = px.scatter_mapbox(
df, # DataFrame
lat='lat',
lon='lon',
width=600, # Width of map
height=600, # Height of map
color='price_aprox_usd',
hover_data=["price_aprox_usd"], # Display price when hovering mouse over house
)
fig.update_layout(mapbox_style="open-street-map")
fig.show()
Just by looking at the map it is already noticeable that the closer the apartment is to the coast, the more expensive it gets. Looking at that also reminded me that neighborhoods also usually play an important role in an apartment price. I will be investigating if that is a correct assumption. In the available data the column place_with_parent_names has three informations: Country | City | Neighboorhood, since the latter is imporant now I will create a column with it
def wrangle(filepath):
"""
Read CSV files into a 'DataFrame'.
Returns only property apartments in Capital Federal which the price is less than 400 000 USD.
Parameters
----------
filepath : str
Location of CSV file.
"""
# Import CSV
df = pd.read_csv(filepath, encoding="ISO-8859-1")
# Subset to properties in 'Capital Federal'
mask_ba = df["place_with_parent_names"].str.contains("Capital Federal")
# Subset to 'apartments'
mask_apt = df["property_type"] == "apartment"
# Subset to properties where price is less than 400 000
mask_price = df["price_aprox_usd"] < 400_000
# Applying all masks
df = df[mask_ba & mask_apt & mask_price]
# Remove outliers by "surface_covered_in_m2"
low, high = df["surface_covered_in_m2"].quantile([0.1, 0.9])
mask_area = df["surface_covered_in_m2"].between(low, high)
df = df[mask_area]
# Split the "lat-lon" col
df[["lat", "lon"]] = df["lat-lon"].str.split(",", expand=True).astype(float)
df.drop(columns="lat-lon", inplace=True)
# Creating a neighboorhood column
df["neighborhood"] = df["place_with_parent_names"].str.split('|', 0, expand=True)[3]
# Return new Data Frame
return df
df = wrangle("data-1.csv")
df.head()
| operation | property_type | place_with_parent_names | price | currency | price_aprox_local_currency | price_aprox_usd | surface_total_in_m2 | surface_covered_in_m2 | price_usd_per_m2 | price_per_m2 | floor | rooms | expenses | properati_url | lat | lon | neighborhood | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 4 | sell | apartment | |Argentina|Capital Federal|Chacarita| | 129000.0 | USD | 1955949.6 | 129000.0 | 76.0 | 70.0 | 1697.368421 | 1842.857143 | NaN | NaN | NaN | http://chacarita.properati.com.ar/10qlv_venta_... | -34.584651 | -58.454693 | Chacarita |
| 9 | sell | apartment | |Argentina|Capital Federal|Villa Luro| | 87000.0 | USD | 1319128.8 | 87000.0 | 48.0 | 42.0 | 1812.500000 | 2071.428571 | NaN | NaN | NaN | http://villa-luro.properati.com.ar/12m82_venta... | -34.638979 | -58.500115 | Villa Luro |
| 29 | sell | apartment | |Argentina|Capital Federal|Caballito| | 118000.0 | USD | 1789163.2 | 118000.0 | NaN | 54.0 | NaN | 2185.185185 | NaN | 2.0 | NaN | http://caballito.properati.com.ar/11wqh_venta_... | -34.615847 | -58.459957 | Caballito |
| 40 | sell | apartment | |Argentina|Capital Federal|Constitución| | 57000.0 | USD | 864256.8 | 57000.0 | 42.0 | 42.0 | 1357.142857 | 1357.142857 | 5.0 | 2.0 | 364 | http://constitucion.properati.com.ar/k2f0_vent... | -34.625222 | -58.382382 | Constitución |
| 41 | sell | apartment | |Argentina|Capital Federal|Once| | 90000.0 | USD | 1364616.0 | 90000.0 | 57.0 | 50.0 | 1578.947368 | 1800.000000 | NaN | 3.0 | 450 | http://once.properati.com.ar/suwa_venta_depart... | -34.610610 | -58.412511 | Once |
df.groupby("neighborhood")[["price", "surface_covered_in_m2"]].mean().sort_values(by="price", ascending=False)
| price | surface_covered_in_m2 | |
|---|---|---|
| neighborhood | ||
| Parque Chacabuco | 597018.833333 | 69.000000 |
| Barracas | 491165.064516 | 54.387097 |
| Boedo | 453569.750000 | 50.250000 |
| Saavedra | 331669.558824 | 51.970588 |
| San Cristobal | 331016.729730 | 51.351351 |
| Colegiales | 317316.571429 | 57.380952 |
| Belgrano | 314852.141732 | 57.543307 |
| San Telmo | 298507.484848 | 54.757576 |
| Nuñez | 296303.395833 | 55.583333 |
| Flores | 287899.000000 | 57.500000 |
| Centro / Microcentro | 285826.100000 | 49.500000 |
| Caballito | 274527.260000 | 50.110000 |
| Puerto Madero | 263619.545455 | 56.090909 |
| Constitución | 248722.222222 | 51.777778 |
| Las Cañitas | 228626.875000 | 62.750000 |
| Parque Centenario | 217940.000000 | 48.700000 |
| Recoleta | 208590.800000 | 68.766667 |
| Congreso | 208125.000000 | 57.375000 |
| Coghlan | 194600.600000 | 40.300000 |
| Barrio Norte | 183231.763158 | 64.052632 |
| Villa Luro | 182166.875000 | 59.250000 |
| Palermo | 179831.396825 | 53.293651 |
| Villa Urquiza | 172493.882353 | 52.397059 |
| Almagro | 160323.044444 | 54.000000 |
| Retiro | 155400.000000 | 58.833333 |
| Monserrat | 143211.538462 | 49.384615 |
| Balvanera | 141458.622222 | 55.488889 |
| Villa Crespo | 140503.931818 | 54.000000 |
| Chacarita | 138555.555556 | 57.222222 |
| Abasto | 136000.000000 | 64.000000 |
| Monte Castro | 129316.666667 | 65.000000 |
| Villa General Mitre | 124000.000000 | 69.333333 |
| Villa Devoto | 120536.411765 | 53.882353 |
| Tribunales | 120000.000000 | 47.000000 |
| Villa Pueyrredón | 118163.333333 | 56.666667 |
| Floresta | 117876.470588 | 65.058824 |
| Versalles | 116000.000000 | 62.000000 |
| Once | 115275.000000 | 65.000000 |
| Villa del Parque | 113889.866667 | 54.466667 |
| Paternal | 113065.500000 | 50.812500 |
| Parque Patricios | 113033.333333 | 53.500000 |
| San Nicolás | 111692.307692 | 57.153846 |
| Liniers | 110714.285714 | 62.571429 |
| Villa Ortuzar | 104000.000000 | 44.500000 |
| Parque Chas | 103722.333333 | 45.333333 |
| 102375.000000 | 45.812500 | |
| Agronomía | 101250.000000 | 39.250000 |
| Mataderos | 99613.888889 | 54.888889 |
| Boca | 98666.666667 | 59.200000 |
| Velez Sarsfield | 92000.000000 | 52.000000 |
| Parque Avellaneda | 82400.000000 | 49.200000 |
| Villa Santa Rita | 81683.333333 | 43.000000 |
| Villa Lugano | 75533.333333 | 63.833333 |
Even though the average sizes are similar the average prices are not it is clear that some neighborhoods are more prestigious than others
But ok, given the more obvious assumptions, there are more features to be explored in the DataFrame which can be a little bit tricky to identify its value. So for now, I will be using all features.
df.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 1343 entries, 4 to 8604 Data columns (total 18 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 operation 1343 non-null object 1 property_type 1343 non-null object 2 place_with_parent_names 1343 non-null object 3 price 1343 non-null float64 4 currency 1343 non-null object 5 price_aprox_local_currency 1343 non-null float64 6 price_aprox_usd 1343 non-null float64 7 surface_total_in_m2 965 non-null float64 8 surface_covered_in_m2 1343 non-null float64 9 price_usd_per_m2 927 non-null float64 10 price_per_m2 1343 non-null float64 11 floor 379 non-null float64 12 rooms 1078 non-null float64 13 expenses 349 non-null object 14 properati_url 1343 non-null object 15 lat 1300 non-null float64 16 lon 1300 non-null float64 17 neighborhood 1343 non-null object dtypes: float64(11), object(7) memory usage: 199.4+ KB
The information I have the better, I can impute missing values, however, there still needs to be enough data in a column to do a good imputation. So, if more than half of the data in a column is missing, I will be dropping it then try imputing.
df.isnull().sum() / len(df)
operation 0.000000 property_type 0.000000 place_with_parent_names 0.000000 price 0.000000 currency 0.000000 price_aprox_local_currency 0.000000 price_aprox_usd 0.000000 surface_total_in_m2 0.281459 surface_covered_in_m2 0.000000 price_usd_per_m2 0.309754 price_per_m2 0.000000 floor 0.717796 rooms 0.197319 expenses 0.740134 properati_url 0.000000 lat 0.032018 lon 0.032018 neighborhood 0.000000 dtype: float64
Seems like floor and expenses have a lot of NaN values so they definitely need to go
def wrangle(filepath):
"""
Read CSV files into a 'DataFrame'.
Returns only property apartments in Capital Federal which the price is less than 400 000 USD.
Parameters
----------
filepath : str
Location of CSV file.
"""
# Import CSV
df = pd.read_csv(filepath, encoding="ISO-8859-1")
# Subset to properties in 'Capital Federal'
mask_ba = df["place_with_parent_names"].str.contains("Capital Federal")
# Subset to 'apartments'
mask_apt = df["property_type"] == "apartment"
# Subset to properties where price is less than 400 000
mask_price = df["price_aprox_usd"] < 400_000
# Applying all masks
df = df[mask_ba & mask_apt & mask_price]
# Remove outliers by "surface_covered_in_m2"
low, high = df["surface_covered_in_m2"].quantile([0.1, 0.9])
mask_area = df["surface_covered_in_m2"].between(low, high)
df = df[mask_area]
# Split the "lat-lon" col
df[["lat", "lon"]] = df["lat-lon"].str.split(",", expand=True).astype(float)
df.drop(columns="lat-lon", inplace=True)
# Creating a neighboorhood column
df["neighborhood"] = df["place_with_parent_names"].str.split('|', 0, expand=True)[3]
# Drop columns with more than half null values
df.drop(columns=["floor", "expenses"], inplace=True)
# Return new Data Frame
return df
df = wrangle("data-1.csv")
df.head()
| operation | property_type | place_with_parent_names | price | currency | price_aprox_local_currency | price_aprox_usd | surface_total_in_m2 | surface_covered_in_m2 | price_usd_per_m2 | price_per_m2 | rooms | properati_url | lat | lon | neighborhood | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 4 | sell | apartment | |Argentina|Capital Federal|Chacarita| | 129000.0 | USD | 1955949.6 | 129000.0 | 76.0 | 70.0 | 1697.368421 | 1842.857143 | NaN | http://chacarita.properati.com.ar/10qlv_venta_... | -34.584651 | -58.454693 | Chacarita |
| 9 | sell | apartment | |Argentina|Capital Federal|Villa Luro| | 87000.0 | USD | 1319128.8 | 87000.0 | 48.0 | 42.0 | 1812.500000 | 2071.428571 | NaN | http://villa-luro.properati.com.ar/12m82_venta... | -34.638979 | -58.500115 | Villa Luro |
| 29 | sell | apartment | |Argentina|Capital Federal|Caballito| | 118000.0 | USD | 1789163.2 | 118000.0 | NaN | 54.0 | NaN | 2185.185185 | 2.0 | http://caballito.properati.com.ar/11wqh_venta_... | -34.615847 | -58.459957 | Caballito |
| 40 | sell | apartment | |Argentina|Capital Federal|Constitución| | 57000.0 | USD | 864256.8 | 57000.0 | 42.0 | 42.0 | 1357.142857 | 1357.142857 | 2.0 | http://constitucion.properati.com.ar/k2f0_vent... | -34.625222 | -58.382382 | Constitución |
| 41 | sell | apartment | |Argentina|Capital Federal|Once| | 90000.0 | USD | 1364616.0 | 90000.0 | 57.0 | 50.0 | 1578.947368 | 1800.000000 | 3.0 | http://once.properati.com.ar/suwa_venta_depart... | -34.610610 | -58.412511 | Once |
My next step will be taking out the low and high cardinality values. If there is only one category in a column, it will not provide any unique information to the model. At the other extreme, columns where nearly every row has its own category will not help the model in identifying useful trends in the data.
# Selecting all columns with objects as type, that is, identifiable as a category column
df.select_dtypes("object").head()
| operation | property_type | place_with_parent_names | currency | properati_url | neighborhood | |
|---|---|---|---|---|---|---|
| 4 | sell | apartment | |Argentina|Capital Federal|Chacarita| | USD | http://chacarita.properati.com.ar/10qlv_venta_... | Chacarita |
| 9 | sell | apartment | |Argentina|Capital Federal|Villa Luro| | USD | http://villa-luro.properati.com.ar/12m82_venta... | Villa Luro |
| 29 | sell | apartment | |Argentina|Capital Federal|Caballito| | USD | http://caballito.properati.com.ar/11wqh_venta_... | Caballito |
| 40 | sell | apartment | |Argentina|Capital Federal|Constitución| | USD | http://constitucion.properati.com.ar/k2f0_vent... | Constitución |
| 41 | sell | apartment | |Argentina|Capital Federal|Once| | USD | http://once.properati.com.ar/suwa_venta_depart... | Once |
# Counting how many unique objects there are in those columns
df.select_dtypes("object").nunique()
operation 1 property_type 1 place_with_parent_names 53 currency 2 properati_url 1343 neighborhood 53 dtype: int64
len(df)
1343
Looking at those values it is easy to see that operation, property_type and currency have only one or two values. On the other hand property_url has too many values, in fact, one different for every row in the DataFrame. Updating the wrangle function is the best approach.
def wrangle(filepath):
"""
Read CSV files into a 'DataFrame'.
Returns only property apartments in Capital Federal which the price is less than 400 000 USD.
Parameters
----------
filepath : str
Location of CSV file.
"""
# Import CSV
df = pd.read_csv(filepath, encoding="ISO-8859-1")
# Subset to properties in 'Capital Federal'
mask_ba = df["place_with_parent_names"].str.contains("Capital Federal")
# Subset to 'apartments'
mask_apt = df["property_type"] == "apartment"
# Subset to properties where price is less than 400 000
mask_price = df["price_aprox_usd"] < 400_000
# Applying all masks
df = df[mask_ba & mask_apt & mask_price]
# Remove outliers by "surface_covered_in_m2"
low, high = df["surface_covered_in_m2"].quantile([0.1, 0.9])
mask_area = df["surface_covered_in_m2"].between(low, high)
df = df[mask_area]
# Split the "lat-lon" col
df[["lat", "lon"]] = df["lat-lon"].str.split(",", expand=True).astype(float)
df.drop(columns="lat-lon", inplace=True)
# Creating a neighboorhood column
df["neighborhood"] = df["place_with_parent_names"].str.split('|', 0, expand=True)[3]
# Drop columns with more than half null values
df.drop(columns=["floor", "expenses"], inplace=True)
# Drop low and high cardinality categorical variables
df.drop(columns=["operation", "property_type", "currency", "properati_url"], inplace=True)
# Return new Data Frame
return df
df = wrangle("data-1.csv")
df.head()
| place_with_parent_names | price | price_aprox_local_currency | price_aprox_usd | surface_total_in_m2 | surface_covered_in_m2 | price_usd_per_m2 | price_per_m2 | rooms | lat | lon | neighborhood | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 4 | |Argentina|Capital Federal|Chacarita| | 129000.0 | 1955949.6 | 129000.0 | 76.0 | 70.0 | 1697.368421 | 1842.857143 | NaN | -34.584651 | -58.454693 | Chacarita |
| 9 | |Argentina|Capital Federal|Villa Luro| | 87000.0 | 1319128.8 | 87000.0 | 48.0 | 42.0 | 1812.500000 | 2071.428571 | NaN | -34.638979 | -58.500115 | Villa Luro |
| 29 | |Argentina|Capital Federal|Caballito| | 118000.0 | 1789163.2 | 118000.0 | NaN | 54.0 | NaN | 2185.185185 | 2.0 | -34.615847 | -58.459957 | Caballito |
| 40 | |Argentina|Capital Federal|Constitución| | 57000.0 | 864256.8 | 57000.0 | 42.0 | 42.0 | 1357.142857 | 1357.142857 | 2.0 | -34.625222 | -58.382382 | Constitución |
| 41 | |Argentina|Capital Federal|Once| | 90000.0 | 1364616.0 | 90000.0 | 57.0 | 50.0 | 1578.947368 | 1800.000000 | 3.0 | -34.610610 | -58.412511 | Once |
Another important cleaning process is to drop any column that would constitute leakage, that is, features that were created using our target or that would give our model information that it won't have access to when it's deployed
df.columns.to_list()
['place_with_parent_names', 'price', 'price_aprox_local_currency', 'price_aprox_usd', 'surface_total_in_m2', 'surface_covered_in_m2', 'price_usd_per_m2', 'price_per_m2', 'rooms', 'lat', 'lon', 'neighborhood']
Analyzing the columns it is noticeable that a lot of those columns are considered leakages because since I am trying to build a model to predict sale prices I cannot have them before it happens. So price, price_aprox_local_currency, price_usd_per_m2, price_per_m2. price_aprox_usd will not be dropped because I will be using it as a target and my model will need it.
def wrangle(filepath):
"""
Read CSV files into a 'DataFrame'.
Returns only property apartments in Capital Federal which the price is less than 400 000 USD.
Parameters
----------
filepath : str
Location of CSV file.
"""
# Import CSV
df = pd.read_csv(filepath, encoding="ISO-8859-1")
# Subset to properties in 'Capital Federal'
mask_ba = df["place_with_parent_names"].str.contains("Capital Federal")
# Subset to 'apartments'
mask_apt = df["property_type"] == "apartment"
# Subset to properties where price is less than 400 000
mask_price = df["price_aprox_usd"] < 400_000
# Applying all masks
df = df[mask_ba & mask_apt & mask_price]
# Remove outliers by "surface_covered_in_m2"
low, high = df["surface_covered_in_m2"].quantile([0.1, 0.9])
mask_area = df["surface_covered_in_m2"].between(low, high)
df = df[mask_area]
# Split the "lat-lon" col
df[["lat", "lon"]] = df["lat-lon"].str.split(",", expand=True).astype(float)
df.drop(columns="lat-lon", inplace=True)
# Creating a neighboorhood column
df["neighborhood"] = df["place_with_parent_names"].str.split('|', 0, expand=True)[3]
# Drop columns with more than half null values
df.drop(columns=["floor", "expenses"], inplace=True)
# Drop low and high cardinality categorical variables
df.drop(columns=["operation", "property_type", "currency", "properati_url"], inplace=True)
# Drop leakage
df.drop(columns=['price', 'price_aprox_local_currency', 'price_per_m2', 'price_usd_per_m2'], inplace=True)
# Return new Data Frame
return df
df = wrangle("data-1.csv")
df.head()
| place_with_parent_names | price_aprox_usd | surface_total_in_m2 | surface_covered_in_m2 | rooms | lat | lon | neighborhood | |
|---|---|---|---|---|---|---|---|---|
| 4 | |Argentina|Capital Federal|Chacarita| | 129000.0 | 76.0 | 70.0 | NaN | -34.584651 | -58.454693 | Chacarita |
| 9 | |Argentina|Capital Federal|Villa Luro| | 87000.0 | 48.0 | 42.0 | NaN | -34.638979 | -58.500115 | Villa Luro |
| 29 | |Argentina|Capital Federal|Caballito| | 118000.0 | NaN | 54.0 | 2.0 | -34.615847 | -58.459957 | Caballito |
| 40 | |Argentina|Capital Federal|Constitución| | 57000.0 | 42.0 | 42.0 | 2.0 | -34.625222 | -58.382382 | Constitución |
| 41 | |Argentina|Capital Federal|Once| | 90000.0 | 57.0 | 50.0 | 3.0 | -34.610610 | -58.412511 | Once |
Finally, the last issue to keep an eye out for is multicollinearity, that is, features in the feature matrix that are highly correlated with each other. A good way to detect this is to use a heatmap.
# Get the correlation between all number type columns
corr = df.select_dtypes("number").drop(columns="price_aprox_usd").corr()
# Plot the heatmap
sns.heatmap(corr);
Seems like thare are 3 features that are highly correlated with each other: rooms, surface_covered_in_m2 and surface_total_in_m2. So the ideal would be to keep only one of them. To find out wich one it is the best to keep I will be looking at the data info
df.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 1343 entries, 4 to 8604 Data columns (total 8 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 place_with_parent_names 1343 non-null object 1 price_aprox_usd 1343 non-null float64 2 surface_total_in_m2 965 non-null float64 3 surface_covered_in_m2 1343 non-null float64 4 rooms 1078 non-null float64 5 lat 1300 non-null float64 6 lon 1300 non-null float64 7 neighborhood 1343 non-null object dtypes: float64(6), object(2) memory usage: 94.4+ KB
The best out of the 3 is surface_covered_in_m2 because it has no NaN values
def wrangle(filepath):
"""
Read CSV files into a 'DataFrame'.
Returns only property apartments in Capital Federal which the price is less than 400 000 USD.
Parameters
----------
filepath : str
Location of CSV file.
"""
# Import CSV
df = pd.read_csv(filepath, encoding="ISO-8859-1")
# Subset to properties in 'Capital Federal'
mask_ba = df["place_with_parent_names"].str.contains("Capital Federal")
# Subset to 'apartments'
mask_apt = df["property_type"] == "apartment"
# Subset to properties where price is less than 400 000
mask_price = df["price_aprox_usd"] < 400_000
# Applying all masks
df = df[mask_ba & mask_apt & mask_price]
# Remove outliers by "surface_covered_in_m2"
low, high = df["surface_covered_in_m2"].quantile([0.1, 0.9])
mask_area = df["surface_covered_in_m2"].between(low, high)
df = df[mask_area]
# Split the "lat-lon" col
df[["lat", "lon"]] = df["lat-lon"].str.split(",", expand=True).astype(float)
df.drop(columns="lat-lon", inplace=True)
# Creating a neighboorhood column
df["neighborhood"] = df["place_with_parent_names"].str.split('|', 0, expand=True)[3]
# Drop columns with more than half null values
df.drop(columns=["floor", "expenses"], inplace=True)
# Drop low and high cardinality categorical variables
df.drop(columns=["operation", "property_type", "currency", "properati_url"], inplace=True)
# Drop leakage
df.drop(columns=['price', 'price_aprox_local_currency', 'price_per_m2', 'price_usd_per_m2'], inplace=True)
# Drop columns with multicollinearlity
df.drop(columns=["surface_total_in_m2", "rooms"], inplace=True)
# Return new Data Frame
return df
frames = []
for file in files:
frames.append(wrangle(file))
df = pd.concat(frames)
df.head()
| place_with_parent_names | price_aprox_usd | surface_covered_in_m2 | lat | lon | neighborhood | |
|---|---|---|---|---|---|---|
| 4 | |Argentina|Capital Federal|Chacarita| | 129000.0 | 70.0 | -34.584651 | -58.454693 | Chacarita |
| 9 | |Argentina|Capital Federal|Villa Luro| | 87000.0 | 42.0 | -34.638979 | -58.500115 | Villa Luro |
| 29 | |Argentina|Capital Federal|Caballito| | 118000.0 | 54.0 | -34.615847 | -58.459957 | Caballito |
| 40 | |Argentina|Capital Federal|Constitución| | 57000.0 | 42.0 | -34.625222 | -58.382382 | Constitución |
| 41 | |Argentina|Capital Federal|Once| | 90000.0 | 50.0 | -34.610610 | -58.412511 | Once |
Now it is time to build the model, first I will be splitting the data
# The target is the aproximate price
target = "price_aprox_usd"
# The feature matrix is the rest of the DataFrame
X_train = df[['surface_covered_in_m2', 'lat', 'lon', 'neighborhood']]
y_train = df[target]
To have a better understanding of how well the model is performing I will be comparing it to the baseline mean absolute error
# The overall mean will the the baseline
y_mean = y_train.mean()
# One value for each row
y_pred_baseline = [y_mean] * len(y_train)
print("Mean apt price:", round(y_mean, 2))
print("Baseline MAE:", round(mean_absolute_error(y_train, y_pred_baseline), 2))
Mean apt price: 132383.84 Baseline MAE: 44860.11
The average apartment price is around 133k and the baseline MAE is around 45k. That is, if the overall mean was used to predict the value of every apartment it would be missing the price by around 45k.
For my model I will be using OneHotEncoder because of the neighborhood column and SimpleImputer for the missing values. Before choosing the model it would be interesting to see how many unique neighborhood there are
df.neighborhood.nunique()
57
model = make_pipeline(OneHotEncoder(use_cat_names=True), SimpleImputer(), LinearRegression())
model.fit(X_train, y_train)
Pipeline(steps=[('onehotencoder',
OneHotEncoder(cols=['neighborhood'], use_cat_names=True)),
('simpleimputer', SimpleImputer()),
('linearregression', LinearRegression())])
intercept = model.named_steps['linearregression'].intercept_
coefficients = model.named_steps['linearregression'].coef_
print("coefficients len:", len(coefficients))
print(coefficients[:5]) # First five coefficients
coefficients len: 60 [ 2208.54913208 247000.81749086 148752.4894247 561.92516784 13068.13103889]
Calculating the training mean absolute error for the predictions as compared to the true targets in y_train
# Predicting values
y_pred_training = model.predict(X_train)
# How good is its MAE
print("Training MAE:", round(mean_absolute_error(y_train, y_pred_training), 2))
Training MAE: 24237.29
The model has beaten the mean MAE by around 20k which is a good sign
Communicating the results may depends on where the model will be applied. Here, howevere, I will be using it on an interactive widget where the user can input the information and the results will be presented
def make_prediction(area, lat, lon, neighborhood):
"""
Input the area, latitude, longitude and neighborhood of a house into the Linear Regression model
Returns the apartment predicted price
Parameters
----------
area: float
The size of the house in square feet
lat: float
Latitude of the house
lon: float
Longitude of the house
neighborhood: str
Neighborhood of the house, where it is located
"""
# Create a dictionary with the received info
data = {
"surface_covered_in_m2": area,
"lat": lat,
"lon": lon,
"neighborhood": neighborhood
}
# Create a DataFrame with the dictionary
df = pd.DataFrame(data, index=[0])
# Make a prediction
prediction = model.predict(df).round(2)[0]
# Return the predicted value
return f"Predicted apartment price: ${prediction}"
# Using the fuction for the first time
make_prediction(110, -34.60, -58.46, "Villa Crespo")
'Predicted apartment price: $248868.18'
Creating the interactive widget
# Invoke the interact method
interact(
make_prediction, # Pass it the predict function
area=IntSlider( # Start and end of the area slider
min=X_train["surface_covered_in_m2"].min(),
max=X_train["surface_covered_in_m2"].max(),
value=X_train["surface_covered_in_m2"].mean(),
),
lat=FloatSlider( # Start and end of the latitude slider
min=X_train["lat"].min(),
max=X_train["lat"].max(),
step=0.01,
value=X_train["lat"].mean(),
),
lon=FloatSlider( # Start and end of the longitude slider
min=X_train["lon"].min(),
max=X_train["lon"].max(),
step=0.01,
value=X_train["lon"].mean(),
),
neighborhood=Dropdown(options=sorted(X_train["neighborhood"].unique())), # Dropdown with all the neighborhoods
);
interactive(children=(IntSlider(value=53, description='area', max=101, min=30), FloatSlider(value=-34.59890626…

# Get the model coeficients
coefficients = model.named_steps['linearregression'].coef_
# Get feature names
feature_names = model.named_steps["onehotencoder"].get_feature_names()
# How many features
print("features len:", len(feature_names))
print(feature_names[:5])
features len: 60 ['surface_covered_in_m2', 'lat', 'lon', 'neighborhood_Chacarita', 'neighborhood_Villa Luro']
# Transform list into Series
feat_imp = pd.Series(coefficients, index=feature_names)
feat_imp.head()
surface_covered_in_m2 2208.549132 lat 247000.817491 lon 148752.489425 neighborhood_Chacarita 561.925168 neighborhood_Villa Luro 13068.131039 dtype: float64
# Plot the features with the highest importances
feat_imp.sort_values(key=abs).tail(15).plot(kind='barh')
# Plot and update
plt.xlabel('Importance [USD]')
plt.ylabel('Feature')
plt.title('Feature Importance for Apartment Price')
plt.show()